Garbo

home *** CD-ROM | disk | FTP | other *** search

/ Garbo / Garbo.cdr / mac / hypercrd / hc1_2_x / fretext1.sit / Free Text Help_Services v1.01 / card_24224.txt < prev next >

Wrap

Text File | 1990-04-13 | 6KB | 118 lines

-- card: 24224 from stack: in.01 -- bmap block id: 0 -- flags: 0000 -- background id: 7910 -- name: -- part contents for background part 4 ----- text ----- /* part 2 of 2 */ EXTENSIONS AND ENHANCEMENTS As mentioned repeatedly above, some significant yet straightforward extensions to my current free text IR systems are necessary in order to properly handle 100 MB/day of data. Here, I will briefly sketch out how I plan to attack the key problems during the coming months. I assume that that ample physical storage is available to hold the influx of information online, in a form which allows access to any item in a fraction of a second. My systems have to be modified handle multiple text files as a single database. I propose to do this by adding a third index file to the "keys" and "pointers" files -- a "filelist" index file which will simply contain a list of the database document files along with their lengths. The structure of the "keys" and "pointers" files will remain unchanged (which should maximize compatibility with earlier index programs and minimize the number of new bugs introduced in this step). The index building programs will treat each of the documents in the "filelist" file as part of a single big document for indexing purposes, and the index browsing programs will consult the "filelist" in order to know where to go to retrieve lines of context or chunks of full text for display. A drawback of this multifile "virtual merge" approach is that it will at times be necessary to open and close files during browsing operations. I have not yet run tests, and do not know what penalties the operating system will impose (one hopes only a few milliseconds?) every time a file is opened and closed. With the use of modern operating system RAM caches, I hope that speed will not be a problem. During typical browsing operations, I believe that most database references will be predictable and localized, so caching should help average performance a lot. Another extension which I plan to implement is to add facilities to rapidly merge separate index files upon user demand. Merging already-sorted index files is a very fast operation which should be limited only by disk I/O rates. It will then be possible to keep indices for each day's (or hour's, or whatever) collection of data separate until the time at which a user wants to browse a chosen set of files. The delay to merge the selected separate indices will be about equal to the time required for a single sequential scan through the chosen database. After that start-up delay, searches will progress at the normal full speed (sub-second response time for simple queries). Many data collections which are commonly referred to as a unit can have their merged index files kept online to avoid any search delays. My current browser programs can already run on one computer while searching files resident elsewhere on a network. The generic UNIX version of my browser can also already be used as a "server" process and run on one host, sending back only the (relatively) small amounts of retrieved information that the user wants to see on a local machine. I plan to rewrite some parts of my browser programs to make their use as servers simpler and more efficient; this rewrite will take place along with the other revisions to introduce new features. Index building itself should not be a computationally infeasible operation at a 100 MB/day data rate. My indexer programs already run at 15-20 MB/hour on a 16 MHz 68030 (Mac IIcx), and I have had reports of 60 MB/hour or better performance on faster machines with more memory and higher performance disk drives. I also believe that there is room for a 20% - 50% speed improvement in my indexing algorithms, by applying some of the standard quicksort enhancements discussed in many textbooks. For storing the index files, a simple and obvious modification that I plan to make is to give the user complete freedom to put databases and index files in any directory, on any volume (online storage device) that is desired. This will allow archival databases to reside on high-density optical read-only media, while indices can be built on faster magnetic media and can be moved to the archive when appropriate. To handle databases larger than about 4 GB (2^32) will require modifications to my programs, but (assuming that disk space is available) not major trauma. If the pointers and counters in the index data structures are redeclared to be 6 bytes instead of 4 bytes, for example, it should be possible to handle up to 256 TB of text in theory. Index file overhead will go up to about 120% instead of the current 80% of the database text size. At this point, some simple compression routines might be worth exploring to increase storage efficiency, if it can be done without slowing down the retrieval process. Proximity subset searching should still be straightforward to do using my simple vector representation of the database, but as files get bigger it may be necessary to hold the large subset vectors on disk and page them into memory only as needed. Modern operating systems with large virtual memory spaces should be able to handle that for the program automatically. If I keep the current default 32-byte quantization, then the subset vectors will still be only 0.4% as big as the total database text size, and so even in the multigigabyte zone each subset will only require a few megabytes of space. CONCLUSIONS My bottom-line evaluation is that a free-text IR system such as I have built, plus anticipated extensions, will surely break down -- but not until reaching file sizes in the 10 GB or larger range, or with information arriving at a rate greater than 1 GB of text per day. Then, problems with data transfer rates and with indexing speed may force one to find alternative solutions. Probably a multiprocessor approach using a partitioned database is the best tactic to take at that point. Meanwhile, I see a lot of value still to be derived from my real-time high-bandwidth free-text information retrieval tools, particularly as the costs of data storage continue to decline. /* end of part 2 of 2 */